Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R pacakges. If they have yet to be installed, we will install the R packages and load them onto R environment.

packages = c('tidyverse','ggdist','gghalves')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}

The code chunk below imports exam_data.csv into R environment using read_csv() function of readr package.

exam_data <- read_csv("./data/Exam_data.csv")

ggplot2

Aethetics

The code chunk on the right add the aesthetic element into the plot. Notice that ggplot includes the x-axis and the axis’s label.

ggplot(data=exam_data, aes(x= MATHS))

Geometric Objects: geom_bar

The code chunk below plots a bar chart.

ggplot(data=exam_data,
aes(x=RACE)) +
geom_bar()

Geometric Objects: geom_dotplot

In a dot plot, the width of a dot corresponds to the bin width (or maximum width, depending on the binning algorithm), and dots are stacked, with each dot representing one observation.

ggplot(data=exam_data,
  aes(x = MATHS)) +
  geom_dotplot(dotsize = 0.5)

The code chunk below performs the following two steps:

  • scale_y_continuous() is used to turn off the y-axis, and
  • binwidth argument is used to change the binwidth to 2.5.
ggplot(data=exam_data, aes(x = MATHS)) + 
  geom_dotplot(binwidth=2.5, dotsize = 0.5) + 
  scale_y_continuous(NULL, breaks = NULL)

Geometric Objects: geom_histogram

In the code chunk below, geom_histogram() is used to a simple histogram by using values in MATHS field of exam_data.

ggplot(data=exam_data,
aes(x = MATHS)) +
geom_histogram()

Modifying a geometric object by changing geom()

In the code chunk below,

  • bins argument is used to change the number of bins to 20,
  • fill argument is used to shade the histogram with light blue color, and color argument is used to change the outline colour of the bars in black
ggplot(data=exam_data, 
       aes(x= MATHS)) + 
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

Modifying a geometric object by changing aes()

  • The code chunk below changes the interior colour of the histogram (i.e. fill) by using sub- group of aesthetic().
  • Note that this approach can be used to colour, fill and alpha of the geometric.
ggplot(data=exam_data,
aes(x= MATHS,
fill = GENDER)) +
geom_histogram(bins=20,
color="grey30")

Geometric Objects: geom-density

geom-density() computes and plots kernel density estimate, which is a smoothed version of the histogram.

It is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

The code below plots the distribution of Maths scores in a kernel density estimate plot.

ggplot(data=exam_data, 
       aes(x = MATHS)) + 
  geom_density()

The code chunk below plots two kernel density lines by using colour or fill arguments of aes()

ggplot(data=exam_data, 
       aes(x = MATHS, colour = GENDER)) + 
  geom_density()

Geometric Objects: geom_boxplot

The code chunk below plots boxplots by using geom_boxplot().

ggplot(data=exam_data,
aes(y = MATHS,
x= GENDER)) +
geom_boxplot()

Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different. The code chunk below plots the distribution of Maths scores by gender in notched plot instead of boxplot.

ggplot(data=exam_data, 
       aes(y = MATHS, x= GENDER)) + 
  geom_boxplot(notch=TRUE)

geom objects can be combined

The code chunk below plots the data points on the boxplots by using both geom_boxplot() and geom_point().

ggplot(data=exam_data,
      aes(y = MATHS,
      x= GENDER)) +
  geom_boxplot() +
  geom_point(position="jitter",
size = 0.5)

Geometric Objects: geom_violin

Violin plots are a way of comparing multiple data distributions. With ordinary density curves, it is difficult to compare more than just a few distributions because the lines visually interfere with each other. With a violin plot, it’s easier to compare several distributions since they’re placed side by side. The code below plot the distribution of Maths score by gender in violin plot.

ggplot(data=exam_data,
aes(y = MATHS,
x= GENDER)) +
geom_violin()

Geometric Objects: geom_violin() and geom_boxplot()

The code chunk below combined a violin plot and a boxplot to show the distribution of Maths scores by gender.

ggplot(data=exam_data,
aes(y = MATHS,
x= GENDER)) +
geom_violin(fill="light blue") +
geom_boxplot(alpha=0.5)

Geometric Objects: geom_point()

The code chunk below plots a scatterplot showing the Maths and English grades of pupils by using geom_point().

ggplot(data=exam_data,
aes(x= MATHS,
y=ENGLISH)) +
geom_point()

Statistics, stat

Working with stat - the stat_summary() method

ggplot(data=exam_data, 
       aes(y = MATHS, x= GENDER)) + 
  geom_boxplot() + 
  stat_summary(geom = "point", 
               fun.y="mean", 
               colour ="red", 
               size=4)

Working with stat - the geom() method

The code chunk below adding mean values by using geom_() function and overriding the default stat.

ggplot(data=exam_data, 
       aes(y = MATHS, x= GENDER)) + 
  geom_boxplot() + 
  geom_point(stat="summary", 
             fun.y="mean", 
             colour ="red", 
             size=4)

How to add a best fit curve on a scatterplot?

In the code chunk below, geom_smooth() is used to plot a best fit curve on the scatterplot. - The default method used is loess.

ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) +
  geom_point() + 
  geom_smooth(size=0.5)

The default smoothing method can be overridden as shown below.

ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) + 
  geom_point() + 
  geom_smooth(method=lm, size=0.5)

Facets

  • Facetting generates small multiples (sometimes also called trellis plot), each displaying a different subset of the data.
  • Facets are an alternative to aesthetics for displaying additional discrete variables.
  • ggplot2 supports two types of factes, namely: facet_grid and facet_wrap.

facet_wrap()

  • facet_wrap wraps a 1d sequence of panels into 2d.
  • This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

Working with facet_wrap()

The code chunk below plots a trellis plot using facet-wrap().

ggplot(data=exam_data, 
       aes(x= MATHS)) + 
  geom_histogram(bins=20) + 
  facet_wrap(~ CLASS)

facet_grid() function

  • facet_grid forms a matrix of panels defined by row and column facetting variables.
  • It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

Working with facet_grid()

The code chunk below plots a trellis plot using facet_grid().

ggplot(data=exam_data, 
       aes(x= MATHS)) + 
  geom_histogram(bins=20) + 
  facet_grid(~ CLASS)

Facetting

Plot a trellis boxplot looks similar to the figure below.

ggplot(data=exam_data, 
       aes(y = MATHS, x= CLASS)) + 
  geom_boxplot() + 
  facet_grid(~ GENDER)

Plot a trellis boxplot looks similar to the figure below.

ggplot(data=exam_data, 
       aes(y = MATHS, x= CLASS)) + 
  geom_boxplot() + 
  facet_grid(GENDER ~.)

Plot a trellis boxplot looks similar to the figure below.

ggplot(data=exam_data, 
       aes(y = MATHS, x= GENDER)) + 
  geom_boxplot() + 
  facet_grid(GENDER ~ CLASS)

Coordinates

Working with Coordinate

By the default, the bar chart of ggplot2 is in vertical form. The code chunk below flips the horizontal bar chart into vertical bar chart by using coord_flip().

ggplot(data=exam_data, 
       aes(x=RACE)) + 
  geom_bar() + 
  coord_flip()

How to change to the y- and x-axis range?

The code chunk below fixed both the y-axis and x-axis range from 0-100.

ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) + 
  geom_point() + 
  geom_smooth(method=lm, size=0.5) + 
  coord_cartesian(xlim=c(0,100), 
                  ylim=c(0,100))

Themes

Working with theme

The code chunk below plot a horizontal bar chart using theme_gray().

ggplot(data=exam_data, 
       aes(x=RACE)) + 
  geom_bar() + 
  coord_flip() + 
  theme_gray()

A horizontal bar chart plotted using theme_classic().

ggplot(data=exam_data, 
       aes(x=RACE)) + 
  geom_bar() + 
  coord_flip() + 
  theme_classic()

A horizontal bar chart plotted using theme_minimal().

ggplot(data=exam_data, 
       aes(x=RACE)) + 
  geom_bar() + 
  coord_flip() + 
  theme_minimal()

Plot a horizontal bar chart looks similar to the figure below. - Changing the colors of plot panel background of theme_minimal to lightblue and the color of grid lines to white.

ggplot(data=exam_data, aes(x=RACE)) + 
  geom_bar() + 
  coord_flip() + 
  theme_minimal() + 
  theme(panel.background = element_rect(fill = "lightblue", 
                                        colour = "lightblue", 
                                        size = 0.5, 
                                        linetype = "solid"), 
        panel.grid.major = element_line(size = 0.5, 
                                        linetype = 'solid', 
                                        colour = "white"), 
        panel.grid.minor = element_line(size = 0.25, 
                                        linetype = 'solid', 
                                        colour = "white"))

Designing Data-drive Graphics for Analysis I

The original design

A simple vertical bar chart for frequency analysis. Critics:

  • y-aixs label is not clear (i.e. count)
  • To support effective comparison, the bars should be sorted by their resepctive frequencies.
  • For static graph, frequency values should be added to provide addition information.

image

The makeover design

The code chunk.

ggplot(data=exam_data, 
       aes(x=reorder(RACE,RACE, function(x)-length(x))))+ 
  geom_bar() + 
  ylim(0,220) + 
  geom_text(stat="count", 
            aes(label=paste0(..count.., ", ", 
                             round(..count../sum(..count..)*100, 1), "%")), 
            vjust=-1) + 
  xlab("Race") + 
  ylab("No. of\nPupils") + 
  theme(axis.title.y=element_text(angle = 0))

This code chunk uses fct_infreq() of forcats package.

exam_data %>% 
  mutate(RACE = fct_infreq(RACE)) %>% 
  ggplot(aes(x = RACE)) + 
  geom_bar()+ ylim(0,220) + 
  geom_text(stat="count", 
            aes(label=paste0(..count.., ", ", 
                             round(..count../sum(..count..)*100, 1), "%")), 
            vjust=-1) + 
  xlab("Race") + 
  ylab("No. of\nPupils") + 
  theme(axis.title.y=element_text(angle = 0))

Designing Data-drive Graphics for Analysis II

The original design

image

The makeover design

  • Adding mean and median lines on the histogram plot.
  • Change fill color and line color

The code chunk

ggplot(data=exam_data, 
       aes(x= MATHS)) + 
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") + 
  geom_vline(aes(xintercept=mean(MATHS, na.rm=T)), 
             color="red", 
             linetype="dashed", 
             size=1) + 
  geom_vline(aes(xintercept=median(MATHS, na.rm=T)), 
             color="grey30", 
             linetype="dashed", 
             size=1)

Designing Data-drive Graphics for Analysis III

The histograms are elegantly designed but not informative. This is because they only reveal the distribution of English scores by gender but without context such as all pupils.

image

The makeover histograms are not only elegantly designed but also informative. This is because they reveal the distribution of English scores by gender with reference to all pupils.

The code chunk below is used to create the makeover design on the right. Note that the second line is used to create the so called Background Data - full without the 3th column (GENDER).

d <- exam_data 
d_bg <- d[, -3] 
ggplot(d, aes(x = ENGLISH, 
              fill = GENDER)) + 
  geom_histogram(data = d_bg, 
                 fill = "grey", 
                 alpha = .5) + 
  geom_histogram(colour = "black") + 
  facet_wrap(~ GENDER) + 
  guides(fill = FALSE) + 
  theme_bw()

Designing Data-drive Graphics for Analysis IV

The orginal design

image

The code chunk used to create the makeover.

ggplot(data=exam_data, 
       aes(x=MATHS, y=ENGLISH)) + 
  geom_point() + 
  coord_cartesian(xlim=c(0,100), 
                  ylim=c(0,100)) + 
  geom_hline(yintercept=50, 
             linetype="dashed", 
             color="grey60", 
             size=1) + 
  geom_vline(xintercept=50, 
             linetype="dashed", 
             color="grey60", 
             size=1)

Beyond Basic Statistical Graphic

Split violin plots

devtools::install_github("psyteachr/introdataviz")
ggplot(exam_data, 
       aes(x = RACE, 
           y = MATHS, 
           fill = GENDER)) + 
  introdataviz::geom_split_violin(alpha = .4, 
                                  trim = FALSE) + 
  geom_boxplot(width = .2, 
               alpha = .6, 
               fatten = NULL, 
               show.legend = FALSE) + 
  stat_summary(fun.data = "mean_se", 
               geom = "pointrange", 
               show.legend = F, 
               position = position_dodge(.175)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), 
                      limits = c(0, 100)) + 
  scale_fill_brewer(palette = "Dark2", 
                    name = "Language group")

rainclound plots

This hands-on exercise introduces ggdist package. You will learn how to create raincloud plots as shown on Slide 23 of Lesson 1. - First, stat_halfeye() of ggdist package is used to create a half violin plot on the right of the vertical axis.

ggplot(exam_data, 
       aes(x = RACE, 
           y = MATHS)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), 
                     limits = c(0, 100)) + 
  stat_halfeye(adjust = .33, 
               width = .67, 
               color = NA, 
               justification = -0.01, 
               position = position_nudge( x = .15) )

  • Next, stat_dots() of ggdist package is used to create the dot plots on the left.
ggplot(exam_data, 
       aes(x = RACE, 
           y = MATHS)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), 
                     limits = c(0, 100)) + 
  stat_halfeye(adjust = .33, 
               width = .67, 
               color = NA, 
               justification = -.01, 
               position = position_nudge( x = .15) ) + 
  stat_dots(side = "left", 
            justification = 1.1, 
            binwidth = .25, 
            dotsize = 5)

  • Lastly, coord_flip() of ggplot2 is used to rotate the vertical raincloud plots into horizontal raincloud plots.
ggplot(exam_data, 
       aes(x = RACE, y = MATHS)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), 
                     limits = c(0, 100)) + 
  stat_halfeye(adjust = .33, 
               width = .67, 
               color = NA, 
               justification = -.01, 
               position = position_nudge( x = .15) ) + 
  stat_dots(side = "left", 
            justification = 1.1, 
            binwidth = .25, 
            dotsize = 5) + 
  coord_flip()

In this alternative design, boxplots are added by using geom_boxplot() of ggplot2.

ggplot(exam_data, 
       aes(x = RACE, 
           y = MATHS)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), 
                     limits = c(0, 100)) + 
  stat_halfeye(adjust = .33, 
               width = .67, 
               color = NA, 
               justification = -.01, 
               position = position_nudge( x = .15) ) + 
  geom_boxplot( width = .25, 
                outlier.shape = NA ) + 
  stat_dots(side = "left", 
            justification = 1.2, 
            binwidth = .25, 
            dotsize = 5) + 
  coord_flip()

ridge plot

This hands-on exercise introduces ggridge, an ggplot2 extension specially designed to create ridge plot. ggridges package provides two main geoms, namely:

geom_ridgeline and geom_density_ridges. The former takes height values directly to draw ridgelines, and the latter first estimates data densities and then draws those using ridgelines.

The code chunk below uses geom_density_ridges() to create a basic ridge density plot.

library(ggridges)
ggplot(exam_data, 
       aes(x = MATHS, 
           y = CLASS)) + 
  geom_density_ridges()

  • Trailing tails can be cut off using the rel_min_height aesthetic. This aesthetic sets a percent cutoff relative to the highest point of any of the density curves. A value of 0.01 usually works well, but you may have to modify this parameter for different datasets.
library(ggridges)
ggplot(exam_data, 
       aes(x = MATHS, 
           y = CLASS)) + 
  geom_density_ridges(rel_min_height = 0.01)

  • The scale parameter control the extent to which the different densities overlap. A setting of scale=1 for example, means the tallest density curve just touches the baseline of the next higher one. Smaller values create a separation between the curves, and larger values create more overlap.
library(ggridges)
ggplot(exam_data, 
       aes(x = MATHS, 
           y = CLASS)) + 
  geom_density_ridges(rel_min_height = 0.01)

  • The scale parameter control the extent to which the different densities overlap. A setting of scale=1 for example, means the tallest density curve just touches the baseline of the next higher one. Smaller values create a separation between the curves, and larger values create more overlap.
library(ggridges)
ggplot(exam_data, 
       aes(x = MATHS, 
           y = CLASS)) + 
  geom_density_ridges(rel_min_height = 0.01, 
                      scale = 1)

  • ggridges package provides a stat stat_density_ridges that replaces stat_density in the context of ridgeline plots.

In the code chunk below, stat_density_ridges() is used to create probability ridge plot.

ggplot(exam_data, 
       aes(x = MATHS, y = CLASS,
           fill = 0.5 - abs(0.5 - stat(ecdf)))) +
  stat_density_ridges(
    geom = "density_ridges_gradient",
    calc_ecdf = TRUE,
    rel_min_height = 0.001) +                      
  scale_fill_viridis_c(name = "Tail probability",
                       direction = -1)